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Abstract 

We use the susceptible-infected-recovered (SIR) model for disease spread over a network, and em- 
pirically study how well various centrality measures perform at identifying which nodes in a network 
will be the best spreaders of disease on 10 real-world networks. We find that the relative performance 
of degree, shell number and other centrality measures can be sensitive to fi, the probability that an 
infected node will transmit the disease to a susceptible node. We also find that eigenvector centrality 
performs very well in general for values of /3 above the epidemic threshold. 



1 Introduction 

The susceptible-infected-recovered (SIR) model, first introduced in Anderson & May (1979) 
is a popular model for disease spread. In recent years, this model has been applied to social 
networks - situations where the interactions of individuals are modeled as a graph. A key 
problem relating to this model when considering a network structure is how to identify 
the nodes that, if initially infected, will result in the greatest portion of the population 
(in expectation) also becoming infected. These nodes are often referred to as "spreaders." 
Unfortunately, a modification of the proof of a related problem in Chen et al. (2010) 
shows that exactly computing the expected number of infected individuals in a networked- 
structured population given a single initial infectee is #P-hard. This implies that solving 
this problem exactly is likely beyond the ability of today's computer systems. However, 
the literature on complex networks has provided various centrality measures that can be 
used as heuristics. So, inspired by the work of Kitsak et al. (2010), which empirically 
examines the use of degree, betweenness, and shell number for identifying spreaders, we 
conduct a comprehensive evaluation of 10 different centrality measures on 10 real- world 
social network data-sets from various domains (e-mail, disease spread, blogging, power, 
autonomous system, and collaboration). The major contributions of our work are two- 
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Imprecision plots for cond-mat-GCC, p = 11.17 




A o 




Fig. 1 . Imprecision versus p for the cond-mat network with j3 = 11.17. Notice that for this 
/3, £-shell has a lower imprecision, meaning that fc-shell outperforms degree. See Section 3 
for the definitions of imprecision function and p. 

fold. First, we show that the ability of a centrality measure to identify spreaders in the 
SIR model can be sensitive to the j3 parameter, the probability of infection. Second, we 
find that, in general, eigenvector centrality performs very well for values of j3 above the 
epidemic threshold. 

With respect to our first major contribution, we carefully selected the j3 parameter based 
on j3', the epidemic threshold of the network. We can be sure that a contagion can spread to 
a significant portion of the network for j3 > j3', and we studied a variety of different values 
for j8 above this threshold. 

In Figure 1 and 2, we give an example of a network where shell number outperforms 
degree for one value of /3 , but degree outperforms shell number for another value of /3 . In 
Section 5, we give additional examples illustrating that the imprecision functions of other 
centrality measures, as well as the choice of the "best" centrality measure, can be sensitive 
to j3 as well. 

As for our second major contribution, we found that eigenvector centrality consistently 
outperformed all other measures considered, including both shell number and degree (which 
were considered by Kitsak et al.), in all but one of the networks examined. See Figure 3 
for a comparison of £-shell (the best performing centrality measure of Kitsak et al.) with 
eigenvector centrality. Also, if we average over all of our networks, including the one where 
eigenvector was not the best, we find that, on average, eigenvector centrality outperforms 
the other measures. 

The rest of this paper is organized as follows. In Section 2, we review the SIR model, 
discuss how the #P-hardness proof of Chen et al. (2010) applies to this model, and describe 
how we calculate the epidemic threshold of a given complex network. This is followed by 
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Imprecision plots for cond-mat-GCC, p =15.95 
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Fig. 2. Imprecision plots vs. p for the cond-mat network with j3 = 15.95. Notice that 
for this j3, degree has a lower imprecision, meaning that degree outperforms fc-shell, the 
opposite of what we saw in Figure 1 . 

a discussion of the various centrality measures we considered in Section 3 along with a 
review of the description of the "imprecision function" Kitsak et al. (2010) used to measure 
the effectiveness of a centrality measure in identifying the top spreaders in a network. 
We describe our experimental setup and datasets in Section 4 and give a description and 
discussion of the experimental results in Section 5. 

2 The SIR Model 

As in Kitsak et al. (2010), we consider the classic susceptible-infected-recovered (SIR) 
model of disease spread introduced in Anderson & May (1979). In this model, all nodes 
in the network are in one of three states: susceptible (able to be infected), infected, or 
recovered (no longer able to infect or be infected). At each time step, any node infected 
in the last time step can infect any of its neighbors who are in a susceptible state with a 
probability j3 . After that time step, any node previously in an infected state moves into a 
recovered state and is no longer able to infect or be infected. 

2.1 Complexity 

In J. Goldenberg (2001) and Kempe et al. (2003), the authors present a generalization of the 
SIR model known as the independent cascade (IC) model. In this model, the j3 parameter 
can be different for each edge in the network. They define the influence spread of a set of 
nodes as the expected number of individuals in the population infected under the IC model 
given that the set was initially infected. In Chen et al. (2010) this problem was shown to be 
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Imprecision of -shell and eig vs p forp=5 
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Fig. 3. Imprecision of fc-shell minus the imprecision of eigenvector centrality. Positive 
values indicate that fc-shell has a higher imprecision than eigenvector centrality, which 
means that eigenvector centrality typically outperforms fc-shell. 

#P-hard. Here we reconsider their proof, with some modification, to identify the influence 
spread of single node under the SIR model. 

Theorem 2.1 

Calculating the influence spread of a single node under the SIR models is #P-hard. 
Proof 

We prove this theorem by showing a reduction from the known #P-complete problem s — t 
connectivity Valiant (1979). Let G — (V,E) be a directed graph, where V denotes the set of 
vertices, and E denotes the set of edges. Given two vertices s,t G V, the goal is to determine 
the number of subgraphs of G where s is connected to t . In Chen et al. (2010), the authors 
point out that it is easy to see that this is equivalent to computing the probability that s is 
connected to t when each edge in G has an independent probability of 0.5 to be connected 
(and 0.5 to be disconnected). Hence, to embed the s — t connectivity problem into the 
influence spread on the SIR model, we first calculate M s , the expected number of infectees 
given initially infected node s with j3 = 50. We then create G' which is equivalent to G but 
has an additional directed edge from t to a new node t' . Let M' s be the influence spread when 
we consider graph G'. If p(s,t,G) is the probability that t is influenced by s in G (hence the 
solution to the s — t connectivity problem) then M' s — M s + p(s,t,G) ■ Therefore, the 
solution to the s — t connectivity problem can easily be obtained in polynomial time if we 
can efficiently find a solution to the influence spread problem under the SIR model. □ 

Theorem 2. 1 tells us that exact methods for identifying the influence spread of individual 
nodes under the SIR models is likely not possible with today's computer systems. Further, 
as s — t connectivity has no known efficient approximation algorithm with a guarantee 
of accuracy, an approximation scheme for influence spread also seems unlikely. Hence, 
much work on influence spread such as Kempe et al. (2003) relies on estimating influence 
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spread using simulation, which is often expensive computationally and even impractical 
for very large networks. Therefore, in this paper, we look to evaluate various centrality 
measures from the literature as heuristics to identify spreaders under the SIR model. We 
describe these centrality measures in Section 3. Note that the centrality measures are not 
specifically designed to calculate influence spread under the SIR model, and they do not 
account for the infection probability /3 . In the next section, we describe how we select the 
different /3 parameters for the model in our experiments. 



2.2 Selecting the Infection Probability 

We note that for scale-free networks, having degree distribution P(k) ~ k~Y, the literature 
shows that for y < 3, the epidemic threshold of j3 approaches as the number of nodes goes 
to infinity Callaway et al. (2000); Cohen et al. (2000). However, the networks we examine 
are of finite size and have various levels of "scale-freeness", based on the R 2 value of the 
linear correlation of a log-log plot of the degree distribution (see Section 4 for details). 
Instead, we explored j3 values based on the epidemic threshold calculation in Madar et al. 
(2004). Using this method, the SIR model is mapped onto a bond percolation process. 
Assuming a randomly connected network, the average number of influenced neighbors, 
(n) can be written 

where k is the degree of a node, P{k) is the probability of a node having degree k, and (k) 
is the average degree. Since an epidemic state can only be reached when (n) > 1, and from 
(1) we have 

We note that there is some work discussing the effect of different infection probabilities 
on spreading in Kitsak et al. (2010) and more recent and comprehensive study on the topic 
in Castellano & Pastor-Satorras (2012). These works consider the effect of this parameter 
with respect to degree and shell decomposition (and betweenness in Kitsak et al. (2010)). 
Here we consider these and many other centrality measures, and find that some of them, 
such as eigenvector centrality, outperform those in these previous works. 



3 Centrality Measures 

We now describe the centrality measures that we examine in our experiments. We note 
that the major centrality measures in the literature can be classified as either radial (the 
quantity of certain paths originating from the node) or medial (the quantity of certain paths 
passing through the node) as done in Borgatti and Everett Borgatti & Everett (2006). Based 
on the negative result concerning betweenness of Kitsak et al. (2010) and the intuitive 
association between high-radial nodes and spreading, we focused our efforts on radial 
measures. While the work of Kitsak et al. (2010) compares shell number to degree and 
betweenness, we consider several other well-known radial measures in addition to degree, 
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including closeness and eigenvector centrality. As done in Kitsak et al. (2010), we also 
develop "imprecision functions" for these centrality measures. 

3.1 Degree Centrality 

Of all the measures that we are examining, degree is perhaps the most simplistic measure - 
simply the total of incident edges for a given node. As noted throughout the literature, such 
as Wasserman & Faust (1994), it is perhaps the easiest centrality measure to compute. 
Further, in other diffusion processes, such as the voter model on undirected networks 
in Antal et al. (2006), it has been shown to be proportional to the expected number of 
individuals becoming infected 1 . As pointed out in Borgatti & Everett (2006), degree is a 
radial measure as it is the number of paths starting from a node of length 1 . Degree is one 
of three measures considered in Kitsak et al. (2010). 

3.2 Shell Number 

The other radial measure considered in Kitsak et al. (2010), shell number, or "A:-shell 
number", is determined using shell decomposition Seidman (1983). High shell-number 
nodes in the network are often referred to as the "core" and are regarded by Kitsak et al. 
(2010) as influential spreaders under the SIR model. Our results described later in the 
paper confirm this finding, although we also show that fc-shell number was generally out- 
performed by eigenvector centrality. There have also been some more practical applications 
of this technique to find key nodes in a network. For instance, Borge-Holthoefer & Moreno 
(2012); Borge-Holthoefer et al. (2012) uses shell-decomposition to find individuals likely 
to initiate information cascades in an online social network while Carmi et al. (2007) uses 
it to identify key nodes in a subset of autonomous systems on the Internet. 

An example of this process is shown in Figure 4. Given graph G = (V,E), shell decom- 
position partitions a graph into shells and is described in the algorithm below. 

Let ki be the degree of node i. Set 5=1. Let Vs denote the first shell of G. 
while |V| >0do 

while There exists i such that kj = S do 
Remove all i € V where ki = S; 
Also, remove all corresponding adjacent edges. 
Place removed nodes into shell Vs- 
end while 
S++ 
end while 



Technically, the work of Antal et al. (2006) proves that the fixation probability for a single mutant 
invader is proportional to the degree of that node. However, the expected number of mutants, in 
the limit as time goes to infinity, can simply be computed by multiplying fixation probability by 
the number of nodes in thee network. 
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Fig. 4. Consider the progression of the graph above, where the elimination of nodes with 
degree 1 occurs in B and C. D represents the first iteration for the second shell, and E 
represents the complete second shell (as well as the first). F finalizes the decomposition 
with the third shell. 



3.3 Betweenness Centrality 

The intuition behind high betweenness centrality nodes is that they function as "bottle- 
necks" as many paths in the network pass through them. Hence, betweenness is a medial 
centrality measure. Let a st be the number of shortest paths between nodes s and t and a st (v) 
be the number of shortest paths between s and t containing node v. In Freeman (1977), 
betweenness centrality for node v is defined as Li^v^r ■ In most implementations, 



including the ones used in this paper, the algorithm of Brandes (2001) is used to calculate 
betweenness centrality. 



3.4 Closeness Centrality 

Another common measure from the literature that we examined is closeness Freeman 
(1979). Given node i, its closeness C c (i) is the inverse of the average shortest path length 
from node i to all other nodes in the graph. Intuitively, closeness measures how "close" it 
is to all other nodes in a graph. 
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Formally, if we define the shortest path between nodes i to j as function dc{i,j), we can 
express the average path length from i to all other nodes as 

_ Ljev\id G (iJ) 
Hence, the closeness of a node can be formally written as 

L i Ljev\i a G(i,J) 



3.5 Eigenvector Centrality 

The use of the principle eigenvector of the adjacency matrix of a network was first pro- 
posed as a centrality measure in Bonacich (1972). Hence, the intuition behind eigenvector 
centrality is that it measures the influence of a node based on the sum of the influences of 
its adjacent nodes. Given a network V = (G,E) with adjacency matrix A = (ay), where 
a, 7 = 1 if an edge exists between nodes i and /', the eigenvector centrality of node i satisfies 

xi = j £ dijXj, (5) 
jeV 

for some A. If we define x to be the vector of x,'s, this relationship can be expressed as 

x = -rAx, or Ax = Xx, (6) 

A 

which is the familiar equation relating A with its eigenvalues and eigenvector. The eigen- 
vector centralities for the network are the entries of the eigenvector corresponding to the 
largest real eigenvalue. 



3.6 Page Rank 

PageRank, introduced in Page et al. (1998), is computed for each node based on the 
PageRank of its neighbors. Where E is the set of undirected edges, R v ,d v is the PageRank 
and degree of v, and c is a normalization constant, we have the relationship 

E %■ 

v'|(v,v')G£ ' 

An initial value for rank is entered for each node and the relationship is then computed 
iteratively until convergence is reached. Intuitively, PageRank can be thought of as the 
importance of a node based on the importance of its neighbors. 



3.7 Neighborhood 

The next centrality measure we consider is the "neighborhood." Given a natural number q, 
the ^-neighborhood of vertex i is the number of nodes in the network that are distance 
q or closer from node i. For example, for q — 0, this metric is 1 for every node. For 
q = 1, this metric is identical to degree centrality of node i, since it is the number of 



empirStudySIRpreprint 23 August 2012 1:40 



Network Science 9 

nodes within a distance 1 of i. For q = 2, this metric counts the number of nodes within a 
distance 2 of i, so it counts Fs neighbors along with its neighbors' neighbors. In our work, 
we computed neighborhoods using q = 2,3,5, 10, and denoted these measures by nghdl, 
nghd3, nghdS, and nghdlO, respectively. We note that the work of Chen et al. (2012) 
develops a centrality measure with a similar intuition to the neighborhood and show it 
preforms well in identifying influential spreaders. 



3.8 The Imprecision Functions 

We now define the imprecision functions from Kitsak et al. (2010) that are used to measure 
the effectiveness of a centrality measure in identifying influential spreaders. We also extend 
their definition for all centrality measures explored in this paper. Let N denote the number 
of nodes, and let p be a real number between and 100. The pN /100 highest efficiency 
spreaders, T e //(p), are chosen based on number of nodes infected M, per node. Similarly, a 
set (p) is defined as the pN/ 100 predicted most efficient spreaders, chosen with priority 
to highest k s valued nodes. Let 

M eff(p) = I fk' md ( ? ) 

ier e ff(p) 

M ks {p) = E %■ (8) 
The imprecision function of k s , £k s (p), is defined as 

Similarly, £ e i g (p) and £d e g(p) are defined as 

**(P) =1 ~TO' (10) 

In general, for any centrality measure c, the imprecision function £ c (p) is defined as 

M eff (p) 



4 Experimental Setup 

In this section we describe our experimental setup and the datasets we used. All simulation 
and centrality analysis was done in Version 2. 14. 1 of R R Development Core Team (201 1). 
The operating system used was Windows Vista Enterprise (32 bit) and the computer had 
an Intel Core 2 Quad CPU (Q9650) 3.0 GHz with 4 GB of RAM. Run times to analyze 
the networks ranged from several hours for the small networks to several days for the 
larger ones. Centrality measures were computed using the igraph Csardi & Nepusz (2006) 
package in R. 

We obtained our datasets from a variety of sources. Brief descriptions of these networks 
are as follows: 
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cond-mat-GCC is an academic collaboration network from the e-print arXiv and covers 
scientific collaborations between authors' papers submitted to Condensed Matter category 
from 1999 Newman (201 1). 

ca-GrQc-GCC is an academic collaboration network from the e-print arXiv and covers 
scientific collaborations between authors' papers submitted to the General Relativity and 
Quantum Cosmology category from Jan. 1993 - Apr. 2003 Leskovec (2012). 
urv-email is an e-mail network based on communications of members of the University 
Rovira i Virgili (Tarragona) Arenas (2012). It was extracted in 2003. 
1-edges-GCC is a network formed from YouTube, the video-sharing website that allows 
users to establish friendship links Zafarani & Liu (2009). The sample was extracted in 
Dec. 2008. Links represent two individuals sharing one or more subscriptions to channels 
on YouTube. 

std-GCC is an online sex community in Brazil in which links represent that one of the 
individuals posted online about a sexual experience with the other individual, resulting in 
a bipartite graph. The data was extracted from September of 2002 to October of 2008 Luis 
E. C. Rocha & Holme (2010). 

as20000102 is a one day snapshot of Internet routers as constructed from the border 
gateway protocol logs Leskovec (2012). It was extracted on Jan 2nd, 2000. 
Oregon 010331 is a network of Internet routers over a one week period as inferred from 
Oregon route-views, looking glass data, and routing registry from covering the week of 
March 3rd, 2001 Leskovec (2012). 

ca-HepTh-GCC is a collaboration network from the e-print arXiv and covers scientific 
collaborations between authors' papers submitted to the High Energy Physics - Theory 
category. It covers paper from Jan 1993 to Apr 2003 Leskovec (2012). 
as-22July06 is a snapshot of the Internet on 22 July 2006 at the autonomous systems level 
compiled by Mark Newman Newman (201 1). 

netscience-GCC is a network of coauthorship of scientists working on network theory and 
experiments compiled by Mark Newman in May 2006 Newman (201 1). 

All datasets used in this paper were obtained from one of four sources: the ASU So- 
cial Computing Data Repository Zafarani & Liu (2009), the Stanford Network Analysis 
Project Leskovec (2012), Mark Newman's data repository at the University of Michi- 
gan Newman (2011), and Universitat Rovira i Virgili Arenas (2012). All networks con- 
sidered were symmetric; i.e., if a directed edge from vertex v to v' exists, there is also an 
edge from vertex v' to v. Summary statistics for these networks can be found in Table 1 . 

In the cases where the network had more than one component, we used only the greatest 
connected component. We append the suffix "-GCC" when referring to those networks. For 
example, the cond-mat network had more than one component, so we will use the greatest 
connected component and refer to this network as "cond-mat-GCC". 

As seen in the Table 1, all networks used are approximately scale free. This does not infer 
that they were generated using a preferential attachment model (as introduced in Albert- 
Lszl Barabsi (1999)), as many mechanisms can be responsible for generating scale free 
networks. If they were generated using a preferential attachment model then we would see 
a correlation between shell number and degree. This would also mean that degree centrality 
and shell number would have little difference in predicting spreaders, but our simulations 
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Name 


Type 


Nodes 


Edges 


Density 


P' 


X 


R 2 


(k) 


(k 2 ) 


Ks 




1-edges-GCC 


online 


13679 


76741 


0.0008 


2.3 


1.8 


0.90 


11.2 


502.6 


25 


as20000102 


router 


6474 


12572 


0.0006 


0.6 


1.2 


0.73 


3.9 


640.0 


12 


ca-GrQc-GCC 


collab 


4158 


13422 


0.0016 


6.3 


2.0 


0.88 


5.5 


93.2 


43 


cond-mat-GCC 


collab 


13861 


44619 


0.0005 


8.4 


2.4 


0.93 


5.9 


75.6 


17 


oregon2_010331 


router 


10900 


31180 


0.0005 


0.5 


1.2 


0.79 


5.7 


1188.8 


31 


std-GCC 


std 


15810 


38540 


0.0003 


3.7 


1.9 


0.92 


4.7 


130.9 


11 


urv-email 


email 


1133 


5451 


0.0085 


5.7 


1.5 


0.84 


9.6 


179.8 


11 


ca-HepTh-GCC 


collab 


8638 


24806 


0.0007 


8.3 


2.2 


0.90 


5.7 


74.6 


31 


as-22July2006 


router 


22963 


48436 


0.0002 


0.4 


1.2 


0.72 


4.2 


1103.0 


25 


netscience-GCC 


collab 


379 


914 


0.0127 


14.2 


1.6 


0.76 


4.8 


38.7 


8 



Table 1. Network Summary Statistics. Note that )3' is the minimum threshold of infection 
rate for the epidemic to spread to a significant portion of the network, X is exponent of the 
power law of the degree distribution, R 2 is goodness of fit between the power law and the 
degree distribution, (k) and (k 2 ) are the first and second moments of the degree distribution, 
and Kg is the maximum shell present in the network. 



show otherwise. Figure 5 shows an example in which degree and shell number are not 
correlated. 



5 Results 

Earlier we noted that (1) the relative performance of degree, shell number and other central- 
ity measures can depend on the )3 parameter of the SIR model, and (2) eigenvector central- 
ity performs very well in general regardless of the value of )3 used, typically outperforming 
all of the other centrality measures that we tried. Here we present more results illustrating 
these two points. Unless otherwise specified, the )3 values that we used when plotting 
the imprecision function versus )3 are 1.1)3', 1.2/3', ... ,2. 0j3', where )3' is the epidemic 
threshold for the network in question. 

5.7 Sensitivity to j3 

In Figures 1 and 2, we saw that the performance of degree relative to shell number changes 
with )3 for the cond-mat network. For )3 = 11.17, shell number was a better indicator 
of spreading, but for )3 = 15.95, degree was better. Another way that we could depict 
this dependence on )3 is to fix p and plot the imprecision versus )3, instead of fixing )3 
and plotting the imprecision versus p. In Figure 6, we fix p = 5 and plot the imprecision 
function of degree, shell number, and eigenvector centrality versus )3, for )3 between 1 1 . 17 
and 15.95. Notice that at around )3 = 14, degree begins to outperform shell number. 

The relative performance of other centrality measures can change as well. In Figure 
7, we plot the imprecision functions of degree, shell number, eigenvector, and closeness 
centrality versus )3 for p = 5. In this network, for )3 near )3', degree and shell number 
perform very well. However, as )3 increases, the imprecision functions of those measures 
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Degree vs -shell number for cond-mat-GCC 




12 5 



Fig. 5. In the higher shells of these two examples, degree and shell number are not 
correlated, indicating these can not be assumed to be generated by preferential attachment 
models. The red line shows the average degree of each shell. Note that log scales are being 
used on both axes. 

increase, and other measures, like closeness and eigenvector, outperform degree and shell 
number. 

5.2 Eigenvector centrality 

As we saw in Figure 3, eigenvector centrality outperforms shell number for all but one 
of the networks we examined. Eigenvector centrality also typically outperforms all of the 
other centrality measures that we tried. In Figure 8, we plot the imprecision functions of 
several different centrality measures for the cond-mat network. We see that eigenvector 
centrality performs best for this network. In Figures 9, 10, 11, and 12, we give an example 
of a collaboration network, an online network, an STD network, and an email network in 
which eigenvector performs best. 

Eigenvector centrality did not outperform shell number for the ca-HepTh network, so we 
can not conclude that eigenvector centrality performs best for every network that we tried. 
However, it does seem that, on average, for the networks we tried, eigenvector centrality 
performed best for /3 = 1.1/3', 1.2/3', 2.0)3'. Suppose we take the imprecision functions 
for /3 = 1 . 1)3' for each network, and we average these imprecision functions over all of our 
networks, including the ca-HepTh network. This would be one way to check how well each 
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Imprecision vs p for cond-mat-GCC with p=5 
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Fig. 6. Imprecision vs p for the cond-mat network. The relative performance of degree and 
shell number changes near /3 = 14. 



Imprecision vs p fbrca-GrQc-GCC with p=5 
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Fig. 7. Imprecision vs fi for the ca-GrQc-GCC network. 
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Imprecision plots for cond-mat-GCC, p =8.77 
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Fig. 8. Imprecision vs p for the cond-mat-GCC network with /3 
that eigenvalue centrality performs best for this network. 



1.1/3' = 8.77. We see 



centrality measure performs on average. In Figure 13, we plot this the average imprecision 
versus p for /3 = 1.1/3'. We see that, on average, eigenvector centrality outperforms the 
other measures. The measure nghdl performed well also. We give similar figures for /3 = 
1.5/3' and /3 = 2.0/3' in Figures 14 and 15. In both cases, eigenvector centrality outperforms 
all of the other measures. 

We believe that eigenvector centrality performs well for some of the same reasons that 
shell number performs well. A node has high eigenvector centrality when the node and 
its neighbors have high degree. Nghd2, nghd3, and the closely related measure of Chen 
et al. (2012) also perform well for this reason. A hub, or a node with high degree, in the 
periphery of a network, which does not have many neighbors with high degree, will not 
typically be as good of a spreader as a node with high eigenvector centrality. 



5.3 Large values of /3 

In Kitsak et al. (2010), only relatively small values for /3 were explored as it was noted 
that larger values of /3 would likely cause spreading to a large portion of the population 
regardless of the location of the initially infected node. However, in the networks we 
studied, we found a difference in the ability of the starting node to spread even at seven 
times the epidemic threshold. Further, the result that eigenvector centrality performs best, 
based on average imprecision over all the networks, still holds for these larger values of 
/3. We display our imprecision functions for larger values of /3 in Figure 16. We also show 
that for five times the epidemic threshold, eigenvector centrality still outperforms the other 
centrality measures for different values of p (Figure 17). 
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Imprecision plots for netscience-GCC, p = 15.67 
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Fig. 9. Imprecision vs p for the netscience-GCC network with /3 = 1.1/3' = 15.67. We see 
that eigenvalue centrality performs best for this network. 



Imprecision plots for 1-edges-GCC, p =2.5 
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Fig. 10. Imprecision vs p for the 1-edges-GCC network with /3 = 1.1/3' = 2.50. We see 
that eigenvalue centrality performs best for this network. 
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Imprecision plots for std-GCC, p =4.01 
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Fig. 1 1 . Imprecision vs p for the std-GCC network with 
eigenvalue centrality performs best for this network. 



1.1/3' =4.01. We see that 



Imprecision plots for urv-email, p =6.22 
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Fig. 12. Imprecision vs p for the urv-email network with /3 = 1.1/3' = 6.22. We see that 
eigenvalue centrality performs best for this network. 
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Average imprecision vs p, p = Lip' 
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Fig. 13. Average Imprecision vs p with j3 = 1.1/3', where the average is taken over all 
networks that we considered. 



Average imprecision vs p, p = 1.5(3' 
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Fig. 14. Average Imprecision vs p with /3 = 1.5/3', where the average is taken over all 
networks that we considered. We see that, on average, eigenvector performs best. 
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Average imprecision vs p, p =2, Op' 
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Fig. 15. Average Imprecision vs p with /3 = 2.0(5', where the average is taken over all 
networks that we considered. We see that, on average, eigenvector performs best. 



Average imprecision vs p, with p = 5 




Fig. 16. Average Imprecision vs. j3 with p — 5. We see that, on average, eigenvector 
performs best. 
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6 Conclusions and Future Work 

These new experiments provide further insight into the issue of identifying spreaders in 
complex networks that was initiated by Kitsak et al. (2010). We extended their work by 
studying multiple values of the infection probability j3 and showed that the relative ability 
for centrality measures to identify spreaders often depends on this parameter. We also 
noted that eigenvector centrality consistently outperforms the other centrality measures, 
usually independent of j3. Future work on identifying influential spreaders could include 
identifying nodes that not only cause significant spreading, but do so quickly, thus ac- 
counting for the time it takes for individuals in the population to become infected. Further, 
it would be also interesting to examine which centrality measures best identify spreaders in 
non-monotonic models of diffusion processes, such as the voter model. Another aspect for 
future work would be to examine group centrality. In other words, one could use a centrality 
measure on sets of nodes to identify the best set of spreaders under the SIR model Moores 
et al. (2012). Finally, it is also worth empirically studying centrality measures designed 
specifically for the SIR model or other diffusion process, as described in recent work such 
as Klemm et al. (2012) and Kang et al. (2012). However, we note that one key advantage 
to the approach taken in this paper is that the centrality measures studied are already well 
established - and hence common in many software tools for complex network analysis. 
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